Law Enforcement & Public Safety
A Details of the Experiments
A.1 Details of the Datasets Here we introduce the details of the datasets used in the experiments. Table 6: Dataset Summary COMPAS: COMPAS [16] is a dataset containing the criminal records of 6,172 individuals arrested in Florida. The task is to predict whether the individual will commit a crime again in 2 years. The probability predicted by the system will be used as a risk score. We use 13 attributes for prediction.
Long-form factuality in large language models
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for longform factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality.
Did faulty drug tests taint parole hearings? California is reviewing hundreds of denials
The California Department of Corrections and Rehabilitation is reviewing hundreds of state parole hearings to see if any inmates who were denied parole were rejected because of faulty drug tests. Nearly 6,000 drug tests in California prisons are believed to have yielded false positives between April and July last year, and attorneys for the Board of Parole are now conducting a review of inmate files to determine if any of them need to appear before the parole board again to be reconsidered, according to officials with CDCR. If any inmates were denied parole because of the faulty tests, they could be owed a new hearing before the parole board, said attorneys representing inmates affected by the defective drug tests. The review is already underway and will determine if "without the positive drug screening, there is sufficient evidence to support an incarcerated person's denial of parole," said CDCR spokesperson Emily Humpal in a statement. If there isn't enough evidence to support incarceration other than the drug test, a new hearing will be scheduled.
Order-Independence Without Fine Tuning
Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, Cynthia Dwork
The development of generative language models that can create long and coherent textual outputs via autoregression has lead to a proliferation of uses and a corresponding sweep of analyses as researches work to determine the limitations of this new paradigm. Unlike humans, these'Large Language Models' (LLMs) are highly sensitive to small changes in their inputs, leading to unwanted inconsistency in their behavior. One problematic inconsistency when LLMs are used to answer multiple-choice questions or analyze multiple inputs is order dependency: the output of an LLM can (and often does) change significantly when sub-sequences are swapped, despite both orderings being semantically identical. In this paper we present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences. We show that this method provably eliminates order dependency, and that it can be applied to any transformer-based LLM to enable text generation that is unaffected by re-orderings. Delving into the implications of our method, we show that, despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses, and usually significantly less in practice. Thus, Set-Based Prompting can be used as a'dropped-in' method on fully trained models. Finally, we discuss how our method's success suggests that other strong guarantees can be obtained on LLM performance via modifying the input representations.
96% of IT pros say AI agents are a security risk, but they're deploying them anyway
AI agents are being rapidly deployed within organizations even as they sow security fears, according to a new report from data governance firm SailPoint. Based on a global survey of more than 350 IT professionals, the report found that the widespread embrace of agents -- AI systems capable of formulating plans and taking action without human oversight -- is taking place within a security vacuum. Of IT pros who responded, 84% said their organizations already use agents internally, but just over half that number (44%) currently have policies in place to control the agents' behavior. Even more strikingly, 96% of respondents said they view agents as a security risk, yet 98% also said their employers plan to expand their use of agents in the coming year. Agents are the latest wave in a flood of innovation surrounding generative AI, which began in earnest following OpenAI's release of ChatGPT in late 2022.
T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models
The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its safety risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover limited aspects and do not address the unique temporal risk inherent in video generation.
BendVLM: Test-Time Debiasing of Vision-Language Embeddings Walter Gerych 1 Eileen Pan
Vision-language model (VLM) embeddings have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. VLMs are being quickly adopted for a variety of tasks ranging from few-shot classification to text-guided image generation, making debiasing VLM embeddings crucial. Debiasing approaches that fine-tune the VLM often suffer from catastrophic forgetting. On the other hand, fine-tuning-free methods typically utilize a "one-size-fits-all" approach that assumes that correlation with the spurious attribute can be explained using a single linear direction across all possible inputs.
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters Andy Zhou School of Information Sciences
Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as "jailbreaks", which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels. Furthermore, we propose a jailbreak method, JAM (Jailbreak Against Moderation), designed to attack moderation guardrails using jailbreak prefixes to bypass input-level filters and a fine-tuned shadow model functionally equivalent to the guardrail model to generate cipher characters to bypass output-level filters. Our extensive experiments on four LLMs demonstrate that JAM achieves higher jailbreak success ( 19.88) and lower filtered-out rates ( 1/6) than baselines.